Integrative analysis of multi-omics data reveals importance of collagen and the PI3K AKT signalling pathway in CAKUT

Congenital Anomalies of the Kidney and Urinary Tract (CAKUT) is the leading cause of childhood chronic kidney failure and a significant cause of chronic kidney disease in adults. Genetic and environmental factors are known to influence CAKUT development, but the currently known disease mechanism remains incomplete. Our goal is to identify affected pathways and networks in CAKUT, and thereby aid in getting a better understanding of its pathophysiology. With this goal, the miRNome, peptidome, and proteome of over 30 amniotic fluid samples of patients with non-severe CAKUT was compared to patients with severe CAKUT. These omics data sets were made findable, accessible, interoperable, and reusable (FAIR) to facilitate their integration with external data resources. Furthermore, we analysed and integrated the omics data sets using three different bioinformatics strategies: integrative analysis with mixOmics, joint dimensionality reduction and pathway analysis. The three bioinformatics analyses provided complementary features, but all pointed towards an important role for collagen in CAKUT development and the PI3K-AKT signalling pathway. Additionally, several key genes (CSF1, IGF2, ITGB1, and RAC1) and microRNAs were identified. We published the three analysis strategies as containerized workflows. These workflows can be applied to other FAIR data sets and help gaining knowledge on other rare diseases.


Multi-omics integrative analysis with mixOmics
As a first approach to analyse the CAKUT multi-omics data, we used mixOmics, combining the miRNome and peptidome data with the mixOmics package 5 .This approach identifies common patterns among multiple omics datasets by projecting data into a small number of dimensions, where the number of dimensions or components can be specified.Only the samples that matched between the two omics data sets and were in the training cohort of the peptidome study 4 were used (n = 46; 30 non-severe and 16 severe CAKUT cases).This was due to the nature of the analytic approach in the supervised classification method of the mixOmics package.In the mixOmics analysis, the proteomics data were not used, because there were a limited number of matching samples compared to the miRNome and peptidome data (Fig. 1).
As part of the mixOmics analysis, Partial Least-Squares Discriminant Analysis (PLS-DA) and sparse PLS-DA (sPLS-DA) were used to identify a subset of variables that could explain the variability between non-severe Fig. 1.Analysis samples.Specification of the number of samples from each of the three omics data sets that was used for the three bioinformatics strategies (mixOmics, momix, and pathway analysis).For the mixOmics and the momix analysis, the number of samples was reduced, since these methods required samples from the same patient that matched between the different omics data sets.For the pathway analysis methods, all samples with sufficient clinical data were used for analysis.The results of each strategy are highlighted.

Joint multi-omics dimensionality reduction analysis
In the second strategy, we applied eight different unsupervised joint dimensionality reduction methods on the peptidome, proteome, and miRNome data using the momix notebook 7 .We used the 31 samples (18 non-severe CAKUT cases and 13 severe CAKUT cases) that matched between the three omics data sets.A joint dimensionality reduction method decomposes the omics datasets into omics-specific weight matrices and a joint factor matrix.We ran the dimensionality reduction methods to obtain the two most important factors (k = 2).Most non-severe and severe CAKUT patients could be separated by one of these two factors, which segregate the two groups (Fig. 3A-C).To evaluate the methods and choose the most relevant factor, we measured how well the two sample groups could be clustered.For each method and each factor, we used k-means clustering.We ran k-means 1000 times and counted the number of samples that were in the correct cluster in accordance with the clinical diagnosis.The baseline accuracy is 58% (18 over 31), which can be obtained by assigning all the samples to one of the two clusters.The accuracies of the joint dimensionality reduction methods range from 65 to 90% when from the two factors, the better segregating one is taken into account (Table 2).
Based on the accuracy, we selected the three methods that were the most successful in separating non-severe and severe CAKUT patients, namely RGCCA, tICA, and MOFA (Fig. 3A-C).Within the weight matrices created by these methods, we used the weight vectors corresponding to the better of the two factors.We then used the absolute value of the weights assigned to the features and selected the top 5% of peptides, proteins, and miRNAs from each method for further analysis (Supplementary Tables 2-4).
We focused on the peptides and proteins identified as the top 5% by all three methods, and miRNAs identified by two methods, as there was no miRNA in common to all three methods (Fig. 3D-F).This resulted in 106 peptides, 16 proteins, and 13 miRNAs.These 106 peptides correspond to 15 proteins, mainly collagens: COL1A1, COL1A2, COL2A1, COL3A1, COL4A2, COL4A5, COL5A2, COL6A3, COL8A1, COL9A1, COL17A1, SLC17A6, COL18A1, COL22A1, and CP.None of these 15 proteins were identified in the top 5% of the proteome, however, some corresponded to related proteins.For instance, Cadherins (CDH6, CDH9, CDH109) and CADM4 play a role in calcium-dependent cell adhesion.Furthermore, ROBO4, UMOD, HABP2, MADCAM1, HMCN1, and EPHB2 have been indicated to be involved in cell adhesion, cell junctions and/ or the migration of one or more specific cell types.Overall, this indicates that the peptidome and the proteome identify different proteins but similar processes.
We performed enrichment analysis to identify the most important biological processes associated with the selected peptides, proteins, and miRNAs (Supplementary Tables 2-4).In this analysis, we used the proteins selected from the proteome data, the proteins corresponding to the selected peptides and the genes targeted by the selected miRNAs.We used orsum 8 in order to present the enrichment results and to filter redundant annotation terms (Fig. 3G).Five Gene Ontology Biological Process (GO-BP) terms are significantly enriched in both the miRNome and peptidome data, mainly indicating misregulation of organ structure and development in non-severe CAKUT patients versus severe CAKUT patients."Cell Adhesion" (GO:0007155) is the only significantly enriched GO-BP term in the proteomics data.Proteins corresponding to the selected peptides are further enriched in extracellular processes, including the process entitled "collagen-activated tyrosine kinase receptor signaling pathway" (GO:0038063).Cell adhesion and collagen related pathways are also significant when REACTOME pathways are used in the enrichment analysis (Fig. 3H).Finally, for the miRNome data, the REACTOME enrichment analysis of the genes targeted by the selected miRNAs mainly revealed rRNA and Table 1.Contribution scores per omics data for each of the two principal components of the principal component analysis, where sPLS-DA, a variable selection method was applied to select the optimal number of peptides and miRNAs.www.nature.com/scientificreports/transcription processes.The GO-BP enrichment analyses indicated a role for the miRNA regulated genes in metabolomics and biosynthesis, for which misregulation could affect organ structure and development.

Pathway-level analysis
We analysed the CAKUT omics data for overrepresented pathways within the WikiPathways database 9 .From 634 pathways in the database, 38 pathways were overrepresented and had a link between miRNA and protein (or peptide mapped into protein) based on the CAKUT patient data.In these pathways, we found 15 links between miRNome and proteome where both interaction partners are significantly differentially expressed.The "PI3K-Akt Signalling Pathway" (WikiPathways: WP4172) 10 is a major regulator of the cell cycle and it contained five links between miRNAs and the peptidome or proteome (Fig. 4A).The 10 remaining links between miRNA and proteins are indicated in Fig. 4B.The PI3K-Akt pathway also includes certain collagens that had been associated with CAKUT in the original study 4 .However, we could not identify any significant links between these proteins and the miRNome.Instead, in this pathway, we found significant links between four gene products (CSF1, IGF2, ITGB1, and RAC1) and five miRNAs (hsa-miR-130a-3p, hsa-miR-1207-5p, hsa-miR-125b-5p, hsa-miR-134-5p, and hsa-miR-320a).A significant link indicates that a differentially expressed miRNA binds to the mRNA of a differentially expressed protein indicating a regulatory connection.

Discussion
Our main result was that we identified affected pathways and networks in CAKUT, and thereby aid in getting a better understanding of its pathophysiology.We did this by re-using existing data, combining it with new data (miRNome) and using available data from knowledge bases for analysis as a strategy to overcome the notorious shortage of data for rare diseases.The three bioinformatics analyses pointed towards an important role for collagen in CAKUT development and the PI3K-AKT signalling pathway.Additionally, several key genes (CSF1, IGF2, ITGB1, and RAC1) and microRNAs were identified.Finally, driven by the EJP RD project, we applied open science and the FAIR principles to multi-omics rare disease data sets to facilitate their integration and analysis with relevant external data resources and support their reusability by the scientific community for (rare) disease research.
Using the output from the mixOmics approach a network was identified related to collagen and cytoskeleton remodelling, consisting of COL3A1, COL18A1, TMSB4X, and COL1A1, and two smaller networks including COL1A2 and COL4A1.COL3A1, COL18A1, COL1A1, COL1A2 and COL4A1 are collagens and TMSB4X is a G-actin binding protein involved in cytoskeleton formation.In detail, COL3A1 is involved in blood vessel formation and if mutated can cause a vascular type of Ehlers-Danlos syndrome 11 .COL4A1 is also involved in angiogenesis and if mutated can cause several types of hereditary angiopathies.In a study from Plaisier et al. 12 basement membrane defects in kidney and skin were detected in patients with mutations in COL4A1.Animal models typically express defects in blood vessel stability resulting frequently in perinatal cerebral hemorrhage but also eye and kidney malformations 13 .COL18A1 is involved in Knobloch syndrome 1, which is characterised by malformations of the eye and glaucoma 14 .There are several studies on animal models available, which report abnormal eye, head and heart formation and one study reported also abnormal kidney filtration capacity in their mouse model 15 .COL1A1 can cause several forms of osteogenesis imperfecta, variations of Ehler-Danlos syndrome and other bone mineral density variation disorders 16 .Mouse models exist, their phenotype is characterised by high occurrence of bone fractures 17 .COL1A2 can also cause several forms of osteogenesis imperfecta but also the cardiac valvular type of Ehler-Danlos syndrome 18 .Neither COL1A1 nor COL1A2 has been linked to renal abnormalities before.For TMSB4X there are no clear links to diseases known.As a G-actin binding protein involved in cytoskeleton formation and maintenance it was in vitro shown to be essential for coronary vessel development and cell migration 19 .
In this analysis, we used a supervised classification approach with the mixOmics method of the mixOmics package, which requires matching samples among omics data sets.Since the number of overlapping samples in all sets decreased when the proteomics data were included in the mixOmics-based analysis, we decided to exclude the proteomics data for this specific analysis.MixOmics proposes two approaches, sPLD-DA and PLD-DA.The difference between sPLS-DA and PLS-DA was insignificant, probably because sPLS-DA is expected to be beneficial over PLS-DA for high dimensional data 20 .Additionally, mixOmics analysis allowed the identification of a collagen-related cluster solely based on the peptidome and miRNome data.The main variance in the data stemmed from a range of miRNAs that could be connected to a small number of peptides (Fig. 2B).Most of these relations were positive correlations, while only hsa-miR-6768-5p and COL1A1_pep30 (ADGQpGAKGEpGDAGAKGDAGPpGP) had a negative correlation.hsa-miR-6768-5p has not been previously identified or predicted to affect COL1A1.While an important role for collagen in CAKUT was previously established 4,12,21 , COL1A1 has not specifically been linked to CAKUT.Furthermore, this work highlights potential novel miRNA and peptide relations, which might be relevant to study in order to get a better understanding of CAKUT.
Unsupervised joint dimensionality reduction analysis with the momix notebook identified the most relevant molecules from the three omics data sets.We further selected the results of the three best performing joint dimensionality reduction methods among the eight tested methods.From the proteome analysis, CDH6 (P55285), CDH9 (Q9ULB4), and CDH10 (Q9Y6N8) are particularly interesting, as these cadherins regulate hippo signalling, which plays a role in kidney and urinary tract development (Fig. 3E) 22,23 .Furthermore, UMOD (P07911) was previously associated with medullary cystic kidney disease, familial juvenile hyperuricemic nephropathy, and glomerulocystic kidney disease 24 .Whether mutations in UMOD are a cause of CAKUT is still under debate 24 .The peptide analysis revealed COL4A5 (P29400) as an interesting protein, as it is one of the glomerular basement membrane proteins that cause Alport syndrome 25 .COL4A1 (P02462) is also of interest.This protein is identified by all the three best performing methods due to different peptides (MOFA and RGCCA found the peptide COL4A1_pep1, tICA found the peptide COL4A1_pep2) (Supplemental Table 1) and it is Grey nodes mean that there was no expression data found.On the one hand, we found that IGF2, ITGB1, and RAC1 were upregulated in the same direction as their miRNAs.On the other hand, CSF1 was downregulated in contrast to its targeting miRNAs, which were both upregulated.(B) The 10 remaining significantly linked miRNA and proteins, from the 15 interactions that were identified in total.The gene products were selected only when either peptidome or proteome indicated significant levels of differential regulation, as well as, the significant miRNAs targeting them.associated with kidney diseases 12,21,26 .Among the enriched annotation terms, cell-cell adhesion and extracellular matrix organization are known to play a role in the ureteric bud branching 27 .
Comparing the momix and mixOmics workflows, there is an overlap in the identified molecules of interest, including COL1A1, COL1A2, COL3A1, and COL18A1.In the GO enrichment analysis, we obtained different annotation terms indicating that momix and mixOmics approaches are complementary.
The analysis at pathway-level used the molecular interactions of WikiPathways, a pathway database extended with miRNA-target information as a backbone to investigate the interactions of interest.The advantage of this method is that it integrates prior knowledge into the analysis, which is especially important when the signal extracted from the data is low.Using this pathway analysis method, we identified 15 functional links between significant differentially expressed proteins and the miRNome.The PI3K-AKT signalling pathway hosts five of these interactions between the different omics data sets, making this the most relevant pathway for CAKUT disease progression (Fig. 4A).In addition, it harbours several collagen proteins previously identified by the other methods as well.Involvement of the PI3KT-AKT pathway showed up in a study on transcriptomics data of CAKUT patients 28 and via the MDM2 gene on another study using miRNA data 29 .Collagen modifications have been associated with the development of CAKUT 4,21,30 .Whether collagens are causally involved remains to be determined.Kitzler et al., described that COL4A1 variants could be a potential novel cause of autosomal dominant CAKUT in humans leading predominantly to a vesicoureteral reflux and isolated (nonsyndromic) CAKUT phenotype 21 .Variants in different extracellular matrix proteins or proteins that interact with the ECM have been described 30 .Collagens make up a large part of the ECM and remodelling of the ECM, potentially due to such variants, are likely reflected by changes observed in collagen fragments in amniotic fluid.In addition, it is likely that the increased abundance of collagen fragments in amniotic fluid represents ECM remodelling due to kidneys with dys-/hypoplasia, cysts and hyperechogenicity even without gene variants that specifically target the ECM (e.g.HNF1B variants) 4 .
The other interactions from i.a."Focal Adhesion" (WikiPathways: WP306) or "Senescence and Autophagy" (WikiPathways: WP28806) pathway, are shown in Fig. 4B.The limitation of the pathway-level analysis is the dependence on knowledge databases of molecular interactions.Nonetheless, for both pathways and miRNAtarget interactions, there are several options regarding analysis.On the one hand, WikiPathways is an open, community created, and expert curated database 9 .The contributions that define the content are dependent on published literature, and the pathways undergo regular curation to be updated with current findings.On the other hand, miRTarBase is a miRNA-target interaction database that provides manually selected, experimentally validated miRNA-target interactions from published literature 31 .Integrating analysis methods using these and other databases to cross validate the information measured on patient material is important to draw relevant conclusions for disease research.
Altogether, the different bioinformatics strategies and methods presented in this study offer a complementary spectrum of possible multi-omics strategies, which can be used for the analysis of rare disease data sets.Notably, most of these methods identified the same (functional) group of genes, with differences on the weighing of correlation statistics or the use of prior knowledge supported methods.Importantly, methods based on mathematical analysis, ignoring existing biomedical knowledge, allow us to identify potentially interesting findings in a hypothesis free manner.Pathways, or approaches based on prior knowledge in general allow us to select functional and molecular interactions from the given data to support a biomedical interpretation of the results.We demonstrated that a combination of these strategies is advantageous for the analysis of (multi-)omics data in the field of rare diseases.
There is an increasing demand towards open science, which requires providing the data, analysis tools, and whole workflows FAIRly available together with the results.This significantly increases the possibility to reproduce results and counteract the current crisis in reproducibility and trust in scientific studies.This demand is especially high in the rare disease field where the naturally limited number of patients, samples, and data has ever since encouraged international and interdisciplinary collaborations to pool data and exchange methods in how to deal with low sample numbers.To this purpose, we hope to aid research reliability and reproducibility by providing both FAIR metadata and workflows as presented in this study and supported by the EJP RD.
In summary, we provided several different complementary bioinformatics strategies and their results that, in combination, could identify biologically relevant biological molecules, pathways, and networks from multiomics rare disease data sets both in an unsupervised and supervised manner.The identified proteins, peptides, and miRNAs highlight modules relevant for CAKUT disease and they can be used for future investigations and experimental validation.Finally, the application of open science and FAIR principles in this study contributes to the transparency and reusability of data and workflows in, but not limited to, the rare disease field.

Multi-omics data sets
The CAKUT multi-omics data set was obtained from a previously published study and reinvestigated in collaboration with the authors of the original study 4,32 .The ethical approval was given by the patient protection committee of the French south-west and overseas-1 departments (approval number DC-2016-2611).The initial study contains amniotic fluid samples from proteome and peptidome.Here we added novel miRNome data from amniotic fluid samples, which were derived from the same patients as described below.As stated in the previously published studies, the study protocol was approved by the national ethics committees (France, RCB 2010-AO1151-38; Belgium S 55406 and B32220096569), and informed consent was obtained from all participants.All experiments were performed in accordance with relevant named guidelines and regulations.In total 162 individuals were studied, of which 104 samples had a clear postnatal outcome.Patients were diagnosed with either non-severe CAKUT, that were patients with a normal GFR (glomerular filtration rate), moderately

Fig. 2 .
Fig. 2.Integrative analysis of miRNome and peptidome data to identify combinations of variables from both omics data sets in comparison to the single-omics analysis using only peptidome data.(A) Multi-omics integration of miRNome and peptidome data using the block sPLS-DA method of the mixOmics package.Peptidome and miRNome data were matched by patient.Variates 1 and 2 indicate different latent components, where both peptidome and miRNome data are projected onto a smaller 5-dimensional subspace (see "Methods").(B) Circos plot of correlations based on the sPLS-DA results using the miRNome (yellow) and peptidome (green) data of the first two components.miRNAs are indicated by their hsa-miR identifiers.Peptides are mapped to their respective proteins and multiple matches to the same protein are shown with the numbered suffixes.The exact peptide sequences can be found in the Supplemental Table1.Only correlations scoring above 0.80 are shown.(C) Network-based integration of the miRNome, peptidome, and proteome data sets to depict the most relevant molecules identified by the mixOmics approach.The network is composed of the most relevant miRNAs (yellow) and peptides (green) based on sPLS-DA analysis as described in the "Methods" section.In this case, the peptide sequences were used to map peptides to proteins (blue) using sequence alignment (see "Methods").Peptides and miRNAs are indicated as in (B).The larger network is a collagen and cytoskeleton network consisting of COL3A1, COL18A1, TMSB4X involved in cytoskeleton organisation, and COL1A1.The two smaller networks also include COL1A2 and COL4A1.(D) Unsupervised analysis between miRNAs and peptides displayed by a heatmap.The colours are based on their contributions to the first two components.Only miRNA and peptides with correlations above 0.80 are shown.

Fig. 3 .
Fig. 3. Joint multi-omics dimensionality reduction analysis.(A-C) Projections of all samples on the first two factors obtained by (a) RGCCA, (b) tICA and (c) MOFA.(D-F) Overlap of the top 5% peptides, proteins, and miRNAs selected by RGCCA, tICA, and MOFA analysis.(G) GO Biological Process enrichment analysis results of the features selected from different omics data by multiple methods.The significant results from different omics are filtered and integrated by orsum.The rank quartiles of the significant terms are coloured for the specific data sets.Enrichment scores can be found in Supplementary Table5.(H) Reactome enrichment analysis results of the features selected from different omics data by multiple methods (there is no enrichment result for genes selected from the proteome data).The significant results from different omics are filtered and integrated by orsum.The colours indicate the quartile of the rank of the significant term for the specific dataset.Colours as in (G).
Fig. 3. Joint multi-omics dimensionality reduction analysis.(A-C) Projections of all samples on the first two factors obtained by (a) RGCCA, (b) tICA and (c) MOFA.(D-F) Overlap of the top 5% peptides, proteins, and miRNAs selected by RGCCA, tICA, and MOFA analysis.(G) GO Biological Process enrichment analysis results of the features selected from different omics data by multiple methods.The significant results from different omics are filtered and integrated by orsum.The rank quartiles of the significant terms are coloured for the specific data sets.Enrichment scores can be found in Supplementary Table5.(H) Reactome enrichment analysis results of the features selected from different omics data by multiple methods (there is no enrichment result for genes selected from the proteome data).The significant results from different omics are filtered and integrated by orsum.The colours indicate the quartile of the rank of the significant term for the specific dataset.Colours as in (G).
Fig. 3. Joint multi-omics dimensionality reduction analysis.(A-C) Projections of all samples on the first two factors obtained by (a) RGCCA, (b) tICA and (c) MOFA.(D-F) Overlap of the top 5% peptides, proteins, and miRNAs selected by RGCCA, tICA, and MOFA analysis.(G) GO Biological Process enrichment analysis results of the features selected from different omics data by multiple methods.The significant results from different omics are filtered and integrated by orsum.The rank quartiles of the significant terms are coloured for the specific data sets.Enrichment scores can be found in Supplementary Table5.(H) Reactome enrichment analysis results of the features selected from different omics data by multiple methods (there is no enrichment result for genes selected from the proteome data).The significant results from different omics are filtered and integrated by orsum.The colours indicate the quartile of the rank of the significant term for the specific dataset.Colours as in (G).

Fig. 4 .
Fig. 4. Pathway enrichment analysis.Visualisation of the interacting differentially expressed proteins/peptides/ miRNAs in the WikiPathways pathway database on the combined miRnome, peptidome, and proteome data.Rectangular nodes represent protein products as determined from the peptidome and proteome, ellipses represent miRNAs indicated by their hsa-miR identifiers.(A) Visualisation of the PI3K-Akt Signalling Pathway as adjusted from WikiPathways (WikiPathways:WP4172).Only a part of the pathway is shown from the larger pathway to emphasise the section where most differential expressions occurred.Blue indicates downregulation and red upregulation, as indicated by the gradient bar.Asterisks indicate the enrichment significance (p-value).Grey nodes mean that there was no expression data found.On the one hand, we found that IGF2, ITGB1, and RAC1 were upregulated in the same direction as their miRNAs.On the other hand, CSF1 was downregulated in contrast to its targeting miRNAs, which were both upregulated.(B) The 10 remaining significantly linked miRNA and proteins, from the 15 interactions that were identified in total.The gene products were selected only when either peptidome or proteome indicated significant levels of differential regulation, as well as, the significant miRNAs targeting them.

Table 2 .
Accuracy of k-means clustering runs on each one of the two factors calculated by joint multi-omics dimensionality reduction methods.The bold numbers represent the higher accuracy obtained by each method.